Cython and Native Extensions
The Same Algorithm, Three Versions, 210x Apart
Before learning any Cython syntax, look at this benchmark. The algorithm is identical across all three implementations - a rolling sum of squares over an array:
# Version 1: Pure Python
def rolling_sum_squares_py(data: list, window: int) -> list:
result = []
n = len(data)
for i in range(n - window + 1):
total = 0.0
for j in range(window):
total += data[i + j] ** 2
result.append(total)
return result
import time
data = list(range(10_000))
window = 100
start = time.perf_counter()
for _ in range(1000):
rolling_sum_squares_py(data, window)
py_time = time.perf_counter() - start
print(f"Pure Python: {py_time:.3f}s")
Pure Python: 4.21s
Now the Cython version with just cdef type declarations - same algorithm, same .pyx file structure:
# rolling.pyx - Version 2: Cython with cdef types
def rolling_sum_squares_typed(list data, int window):
cdef int n = len(data)
cdef int i, j
cdef double total
cdef list result = []
for i in range(n - window + 1):
total = 0.0
for j in range(window):
total += data[i + j] ** 2
result.append(total)
return result
Cython + cdef types: 0.18s (23x speedup)
And finally with typed memoryviews - the critical Cython feature that eliminates Python object overhead on array access:
# rolling.pyx - Version 3: Cython with typed memoryviews
import numpy as np
cimport numpy as cnp
def rolling_sum_squares_mv(
cnp.ndarray[cnp.float64_t, ndim=1] data,
int window
):
cdef double[::1] data_mv = data # typed memoryview
cdef int n = data_mv.shape[0]
cdef int i, j
cdef double total
cdef cnp.ndarray[cnp.float64_t, ndim=1] result = np.empty(n - window + 1)
for i in range(n - window + 1):
total = 0.0
for j in range(window):
total += data_mv[i + j] * data_mv[i + j]
result[i] = total
return result
Cython + typed memoryviews: 0.020s (210x speedup)
| Version | Time | Speedup | What Changed |
|---|---|---|---|
| Pure Python | 4.21 s | 1x | Baseline |
Cython + cdef types | 0.18 s | 23x | Static C types for loop variables |
| Cython + typed memoryviews | 0.020 s | 210x | Direct C-level array access, no PyObj |
The 210x version is using the same nested loop. No algorithmic change. Cython eliminated the Python object overhead on every array access and gave the C compiler enough type information to generate efficient machine code.
What You Will Learn
- Understand what Cython actually compiles to and how to read the annotation output
- Set up a Cython build using
setup.pyandpyproject.toml - Declare static types with
cdef,cpdef, and typed memoryviews - Release the GIL and use parallel loops with
prange - Call C library functions from Cython
- Use
ctypesandcffias alternatives to full Cython compilation - Know when NOT to use Cython
Prerequisites
| Requirement | Level Needed |
|---|---|
| Python functions and modules | Comfortable |
| NumPy array basics | Familiar |
| C types (int, double, pointer) | Basic awareness |
| gcc or clang on the system | Required |
Section 1: What Cython Actually Does
Cython is a superset of Python. Valid Python is valid Cython. The Cython compiler (the cython command) translates .pyx files into C code, which is then compiled by your system C compiler into a shared library (.so on Linux/macOS, .pyd on Windows) that Python can import.
my_module.pyx
│
│ cython my_module.pyx
▼
my_module.c ← ~5000 lines of generated C
│
│ gcc -shared -fPIC ... my_module.c -o my_module.so
▼
my_module.so ← importable from Python
│
│ import my_module
▼
Python call: my_module.rolling_sum_squares(data, 100)
The generated C code is real C - it handles Python reference counting, type checking at the boundaries, and calling conventions. What you write in .pyx determines how much of that overhead is present in the hot inner loop.
What the Generated C Looks Like
For a pure Python function in .pyx (no type declarations):
/* Generated C for: def add(x, y): return x + y */
static PyObject *__pyx_pw_6mymod_1add(PyObject *__pyx_self, PyObject *__pyx_args) {
PyObject *__pyx_v_x = NULL;
PyObject *__pyx_v_y = NULL;
PyObject *__pyx_r = NULL;
/* ... argument parsing ... */
__pyx_t_1 = PyNumber_Add(__pyx_v_x, __pyx_v_y); // Python-level addition
/* ... reference counting ... */
return __pyx_t_1;
}
That is the same overhead as pure Python. Now with types:
/* Generated C for: def add(double x, double y): return x + y */
static PyObject *__pyx_pw_6mymod_1add(PyObject *__pyx_self, PyObject *__pyx_args) {
double __pyx_v_x;
double __pyx_v_y;
/* ... parse Python args into C doubles once ... */
return PyFloat_FromDouble(__pyx_v_x + __pyx_v_y); // C addition
}
The addition itself is now a native addsd instruction. The Python object overhead exists only at the function boundary (argument parsing and return value creation), not inside loops.
Section 2: Setting Up Cython
Installation
pip install cython numpy
# Ensure C compiler is present:
# macOS: xcode-select --install
# Linux: apt install build-essential
# Windows: Visual Studio Build Tools
setup.py Approach (Classic)
# setup.py
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np
extensions = [
Extension(
name="rolling", # import name
sources=["rolling.pyx"], # source file
include_dirs=[np.get_include()], # NumPy headers
extra_compile_args=["-O3", "-march=native"], # optimise aggressively
)
]
setup(
name="rolling",
ext_modules=cythonize(
extensions,
annotate=True, # generate rolling.html annotation
compiler_directives={
"language_level": "3",
"boundscheck": False, # skip array bounds checking (DANGER: only after testing)
"wraparound": False, # skip negative index support
"cdivision": True, # C division semantics (no Python ZeroDivisionError)
},
),
)
Build:
python setup.py build_ext --inplace
# Creates: rolling.cpython-312-x86_64-linux-gnu.so (or similar)
pyproject.toml Approach (Modern)
# pyproject.toml
[build-system]
requires = ["setuptools", "cython", "numpy"]
build-backend = "setuptools.backends.legacy:build"
[tool.cython]
annotate = true
# setup.py (still needed alongside pyproject.toml for Cython)
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np
setup(
ext_modules=cythonize([
Extension("rolling", ["rolling.pyx"],
include_dirs=[np.get_include()])
], compiler_directives={"language_level": "3"})
)
Inline %%cython in Jupyter
For experimentation without a build system:
# In a Jupyter cell:
%load_ext Cython
%%cython --annotate
# cython: boundscheck=False, wraparound=False
def rolling_sum_cython(double[::1] data, int window):
cdef int n = data.shape[0]
cdef int i, j
cdef double total
cdef double[::1] result = data[:n - window + 1].copy()
for i in range(n - window + 1):
total = 0.0
for j in range(window):
total += data[i + j] * data[i + j]
result[i] = total
return result
The --annotate flag produces the HTML annotation inline in the notebook.
Section 3: Type Declarations
Type declarations are the core of Cython. Without them, Cython is slightly faster than Python. With them, it compiles to C that outperforms most handwritten C naive implementations.
Variable Types: cdef
# rolling.pyx
def compute_stats(data):
"""No type declarations - almost identical speed to Python."""
n = len(data)
total = 0.0
for i in range(n):
total += data[i]
mean = total / n
return mean
def compute_stats_typed(data):
"""With type declarations - C-speed inner loop."""
cdef int n = len(data)
cdef int i
cdef double total = 0.0
cdef double mean
for i in range(n):
total += data[i] # still Python object access if data is a list
mean = total / n
return mean
Function Types: cdef, cpdef, def
| Declaration | Callable from Python? | Callable from Cython? | Overhead |
|---|---|---|---|
def | Yes | Yes (slow) | Full Python ABI |
cpdef | Yes | Yes (fast) | Thin wrapper |
cdef | No | Yes (fast) | None |
# inner.pyx
cdef double _inner_compute(double x, double y) nogil:
"""Pure C function - not callable from Python."""
return x * x + y * y
cpdef double compute(double x, double y):
"""Callable from both Python and Cython efficiently."""
return _inner_compute(x, y)
def compute_batch(list xs, list ys):
"""Standard Python-callable function."""
cdef int n = len(xs)
cdef int i
cdef list result = [0.0] * n
for i in range(n):
result[i] = _inner_compute(xs[i], ys[i])
return result
The Cython Annotation HTML - Reading Yellow Lines
Run cython -a module.pyx to generate module.html. Open it in a browser.
cython -a rolling.pyx
open rolling.html
Each line of your .pyx code is coloured:
- White: pure C - no Python interaction
- Yellow: involves Python API calls - the brighter the yellow, the more Python overhead
- Dark yellow / orange: heavy Python interaction - this is where you need to add types
Example annotation interpretation:
# Yellow (calls PyNumber_Multiply, PyObject boxing)
total += data[i] ** 2
# White (native C floating point multiply)
cdef double val = data_mv[i]
total += val * val
The annotation HTML is the most important Cython debugging tool. After adding type declarations, check that your inner loop lines have turned white.
Section 4: Typed Memoryviews - The Key to Array Performance
Typed memoryviews are the feature that makes Cython's array processing competitive with hand-written C. They provide direct C-level access to the memory of NumPy arrays, array.array, bytes, and any object exposing the buffer protocol.
Declaration Syntax
# 1D C-contiguous array of doubles
cdef double[::1] arr
# 1D Fortran-contiguous array
cdef double[:] arr_f
# 2D C-contiguous (row-major) array
cdef double[:, ::1] matrix
# 2D Fortran-contiguous (column-major)
cdef double[::1, :] matrix_f
The ::1 notation means "contiguous in this dimension" - equivalent to asserting that elements are laid out sequentially in memory without gaps. This allows the compiler to generate optimal load/store instructions.
Matrix Multiplication Example
# matmul.pyx
# cython: boundscheck=False, wraparound=False, cdivision=True
import numpy as np
cimport numpy as cnp
def matmul_python(A, B):
"""Pure Python matrix multiply - O(n³) with full overhead."""
n = len(A)
C = [[0.0] * n for _ in range(n)]
for i in range(n):
for j in range(n):
for k in range(n):
C[i][j] += A[i][k] * B[k][j]
return C
def matmul_cython(
double[:, ::1] A, # C-contiguous 2D
double[:, ::1] B,
):
"""Cython matrix multiply with typed memoryviews."""
cdef int n = A.shape[0]
cdef int i, j, k
cdef double total
cdef double[:, ::1] C = np.zeros((n, n), dtype=np.float64)
for i in range(n):
for j in range(n):
total = 0.0
for k in range(n):
total += A[i, k] * B[k, j]
C[i, j] = total
return np.asarray(C)
Benchmark on 200×200 matrices:
matmul_python: 3.42s
matmul_cython: 0.021s (163x speedup)
np.dot(A, B): 0.0003s (BLAS - 11,400x vs Python)
Note: for matrix operations, NumPy's BLAS backend is still faster than Cython loops because BLAS uses hand-tuned assembly and AVX-512 instructions. Use Cython for operations that NumPy cannot express as a single vectorised call.
Rolling Window - A Realistic Use Case
NumPy does not have a built-in rolling-window arbitrary-function operation. Cython with memoryviews fills this gap:
# rolling_stats.pyx
# cython: boundscheck=False, wraparound=False
import numpy as np
cimport numpy as cnp
def rolling_mean_std(
double[::1] data,
int window,
):
"""
Compute rolling mean and standard deviation.
Returns two arrays: means, stds.
Welford's online algorithm - numerically stable.
"""
cdef int n = data.shape[0]
cdef int out_len = n - window + 1
cdef double[::1] means = np.empty(out_len, dtype=np.float64)
cdef double[::1] stds = np.empty(out_len, dtype=np.float64)
cdef int i, j
cdef double m, s, x, delta, delta2, M2
for i in range(out_len):
# Welford's algorithm over the window
m = 0.0
M2 = 0.0
for j in range(window):
x = data[i + j]
delta = x - m
m += delta / (j + 1)
delta2 = x - m
M2 += delta * delta2
means[i] = m
stds[i] = (M2 / window) ** 0.5
return np.asarray(means), np.asarray(stds)
Comparison vs pandas rolling().mean() and .std():
rolling_mean_std (Cython): 0.043s for 1M points, window=50
pandas rolling: 0.089s for 1M points, window=50
The Cython version is faster because pandas uses Python-level aggregation for arbitrary window functions while this implementation stays entirely in C.
Section 5: GIL Release and Parallel Computation
The most powerful Cython capability for compute-intensive work is releasing the GIL and running loops in parallel across CPU threads.
with nogil: Block
Any Cython function that:
- Uses only C types (no Python objects)
- Calls only other
nogilfunctions
...can release the GIL and allow other threads to run Python code concurrently.
# parallel.pyx
# cython: boundscheck=False, wraparound=False
from cython.parallel import prange
import numpy as np
cimport numpy as cnp
from libc.math cimport sqrt, exp
def apply_gaussian_kernel_serial(
double[::1] data,
double sigma,
):
"""Serial version - processes elements one by one."""
cdef int n = data.shape[0]
cdef double[::1] result = np.empty(n, dtype=np.float64)
cdef int i
cdef double coeff = -1.0 / (2.0 * sigma * sigma)
for i in range(n):
result[i] = exp(coeff * data[i] * data[i])
return np.asarray(result)
def apply_gaussian_kernel_parallel(
double[::1] data,
double sigma,
int n_threads=4,
):
"""
Parallel version - releases GIL, uses prange for OpenMP threading.
Each element is independent → embarrassingly parallel.
"""
cdef int n = data.shape[0]
cdef double[::1] result = np.empty(n, dtype=np.float64)
cdef int i
cdef double coeff = -1.0 / (2.0 * sigma * sigma)
with nogil:
for i in prange(n, num_threads=n_threads, schedule='static'):
result[i] = exp(coeff * data[i] * data[i])
return np.asarray(result)
Build with OpenMP:
# setup.py for parallel Cython
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np
extensions = [
Extension(
"parallel",
["parallel.pyx"],
include_dirs=[np.get_include()],
extra_compile_args=["-O3", "-fopenmp"], # OpenMP flag
extra_link_args=["-fopenmp"],
)
]
setup(ext_modules=cythonize(extensions,
compiler_directives={"language_level": "3"}))
Benchmark on 10M elements:
Serial: 0.182s
Parallel (4 threads): 0.051s (3.6x speedup on 4 cores)
Thread Safety Requirements
To use prange safely, each loop iteration must be independent:
- No shared mutable state written by multiple threads simultaneously
- No Python objects (GIL is released - Python is not thread-safe without it)
- No calls into Python code inside the
nogilblock
Violations cause data races (silently wrong results) or segfaults.
Section 6: Calling C Libraries from Cython
Cython can call any C function directly without the overhead of ctypes or cffi. This is the right approach when you need to wrap a performance-critical C library.
cdef extern from - Declaring C Functions
# math_ext.pyx
# Declare the C functions we want to use
cdef extern from "math.h":
double sin(double x) nogil
double cos(double x) nogil
double sqrt(double x) nogil
double fabs(double x) nogil
# Or use Cython's built-in C math declarations:
from libc.math cimport sin, cos, sqrt, fabs, M_PI
def compute_polar_to_cartesian(
double[::1] r,
double[::1] theta,
):
"""Convert polar coordinates to Cartesian - vectorised in C."""
cdef int n = r.shape[0]
cdef double[::1] x = np.empty(n, dtype=np.float64)
cdef double[::1] y = np.empty(n, dtype=np.float64)
cdef int i
with nogil:
for i in range(n):
x[i] = r[i] * cos(theta[i])
y[i] = r[i] * sin(theta[i])
return np.asarray(x), np.asarray(y)
Wrapping a Custom C Function
Suppose you have a high-performance C library:
// fast_filter.h
double fast_ema(const double *data, int n, double alpha);
# fast_filter.pyx
cdef extern from "fast_filter.h":
double fast_ema(const double *data, int n, double alpha) nogil
def exponential_moving_average(
double[::1] data,
double alpha,
):
"""
Call into a C library function directly.
The data memoryview provides a direct pointer to the array's memory.
"""
cdef int n = data.shape[0]
# &data[0] gives the pointer to the first element
return fast_ema(&data[0], n, alpha)
Section 7: ctypes - Calling C Without Compilation
ctypes is the standard library solution for calling into shared C libraries from Python. No Cython compilation step required. Slower than Cython at the boundary but fine for infrequent calls into fast C functions.
import ctypes
import ctypes.util
import numpy as np
from pathlib import Path
# Load a shared library
libm = ctypes.CDLL(ctypes.util.find_library("m")) # libm - C math library
# Declare the function signature
libm.sin.argtypes = [ctypes.c_double]
libm.sin.restype = ctypes.c_double
# Call it
result = libm.sin(3.14159 / 2.0)
print(f"sin(π/2) = {result:.6f}") # 1.000000
# Loading your own library
lib = ctypes.CDLL(str(Path(__file__).parent / "libfast.so"))
lib.fast_ema.argtypes = [
ctypes.POINTER(ctypes.c_double), # const double *data
ctypes.c_int, # int n
ctypes.c_double, # double alpha
]
lib.fast_ema.restype = ctypes.c_double
def ema_ctypes(data: np.ndarray, alpha: float) -> float:
"""Call fast_ema via ctypes - no compilation step."""
assert data.dtype == np.float64 and data.flags['C_CONTIGUOUS']
ptr = data.ctypes.data_as(ctypes.POINTER(ctypes.c_double))
return lib.fast_ema(ptr, len(data), alpha)
ctypes for Structures
import ctypes
class ImageHeader(ctypes.Structure):
"""Match a C struct layout."""
_fields_ = [
("width", ctypes.c_uint32),
("height", ctypes.c_uint32),
("channels", ctypes.c_uint8),
("depth", ctypes.c_uint8),
]
header = ImageHeader(width=1920, height=1080, channels=3, depth=8)
print(f"Image: {header.width}x{header.height}, {header.channels}ch")
Section 8: cffi - The Modern C Integration API
cffi (C Foreign Function Interface) is more ergonomic than ctypes for complex C APIs. It parses actual C header syntax and is the preferred approach for wrapping large C/C++ libraries.
pip install cffi
ABI Mode (No Compilation)
from cffi import FFI
ffi = FFI()
# Declare the C functions exactly as in the header
ffi.cdef("""
double fast_ema(const double *data, int n, double alpha);
int process_batch(
const double *input,
double *output,
int n,
double threshold
);
""")
# Load the shared library
lib = ffi.dlopen("./libfast.so")
import numpy as np
def ema_cffi(data: np.ndarray, alpha: float) -> float:
"""Call fast_ema via cffi - cleaner than ctypes for complex APIs."""
assert data.dtype == np.float64
# cffi can cast a numpy array's buffer to a C pointer
c_data = ffi.cast("double *", data.ctypes.data)
return lib.fast_ema(c_data, len(data), alpha)
def process_batch_cffi(input_arr: np.ndarray, threshold: float) -> np.ndarray:
"""Demonstrates in/out array pattern."""
n = len(input_arr)
output_arr = np.empty(n, dtype=np.float64)
c_in = ffi.cast("double *", input_arr.ctypes.data)
c_out = ffi.cast("double *", output_arr.ctypes.data)
result_code = lib.process_batch(c_in, c_out, n, threshold)
if result_code != 0:
raise RuntimeError(f"process_batch failed with code {result_code}")
return output_arr
API Mode (With Compilation - Fastest)
from cffi import FFI
ffi = FFI()
ffi.cdef("""
double fast_ema(const double *data, int n, double alpha);
""")
ffi.set_source(
"_fast_lib", # output module name
"""
#include "fast_filter.h"
""",
sources=["fast_filter.c"],
extra_compile_args=["-O3", "-march=native"],
)
if __name__ == "__main__":
ffi.compile(verbose=True)
API mode compiles the C code and links it into a Python extension module. The resulting _fast_lib.so is importable and calls the C function with near-zero overhead.
Section 9: When NOT to Use Cython
Cython adds build complexity. It requires a C compiler, complicates CI/CD pipelines, and makes the code less accessible to contributors who do not know Cython syntax. Before reaching for it:
Decision Matrix
| Situation | Use Cython? | Better Alternative |
|---|---|---|
| Numerical loop over NumPy arrays | Maybe | Numba @njit first - simpler |
| Custom operation NumPy cannot express | Yes | Cython typed memoryviews |
| Wrapping an existing C/C++ library | Yes | Cython or cffi |
| String/text processing bottleneck | No | Python re, regex, or Rust |
| I/O bottleneck (disk, network, database) | No | asyncio or better algorithm |
| Algorithm is O(n²), should be O(n log n) | No | Fix the algorithm first |
| Library already in NumPy/scipy/pandas | No | Use the library |
| One-time data transformation (not in hot path) | No | Not worth the complexity |
| Bottleneck is < 5% of total runtime | No | Profile better targets |
The Cython Complexity Budget
Each .pyx file adds to your project's complexity budget:
- CI must compile Cython before running tests
- Wheels must be built for each Python version and platform (or require compilation on install)
- Stack traces from
.pyxfiles are harder to read - Debugging requires understanding both Python and C error domains
Rule of thumb: Cython is worth the complexity budget when the speedup is 10x or greater and the bottleneck accounts for at least 10% of total runtime. For smaller gains, prefer Numba (zero build complexity) or NumPy vectorisation.
Section 10: Compiler Directives Reference
These directives control Cython's safety vs. performance tradeoffs. Enable them in the file header or globally in setup.py:
# At the top of any .pyx file:
# cython: boundscheck=False, wraparound=False, cdivision=True, nonecheck=False
| Directive | Default | Performance Effect | Safety Cost |
|---|---|---|---|
boundscheck=False | True | 5–30% speedup | Out-of-bounds access = segfault |
wraparound=False | True | 2–10% speedup | Negative indexing silently wrong |
cdivision=True | False | 3–15% speedup | Division by zero = C UB, not ZeroDivision |
nonecheck=False | False | 2–5% speedup | None access = segfault |
initializedcheck=False | True | 2–5% speedup | Uninitialised memoryview = segfault |
language_level=3 | 2 | Required | Must match Python version |
Safety protocol: develop with all safety checks enabled (defaults). Only disable them after tests pass, and only for functions that have been verified correct.
Interview Questions
Q1: What is the difference between cdef, cpdef, and def in Cython? When would you use each?
def creates a standard Python function. It is callable from Python with the full Python calling convention - arguments are Python objects, the return value is a Python object. Inside the function, Cython can use cdef variable types to speed up local computation, but the function entry/exit pays Python overhead.
cdef creates a C-level function that is NOT callable from Python. It accepts and returns C types directly, has zero Python calling overhead, and can be declared nogil. Use cdef for internal helper functions called from within .pyx code that you never need to call directly from Python.
cpdef creates both a C version and a Python wrapper. When called from Cython, the C version is used (fast path). When called from Python, the Python wrapper is used. Use cpdef for functions that need to be both performance-critical when called from Cython AND accessible from Python (e.g., module-level API functions that also call each other internally).
Q2: What is a typed memoryview and why does it enable such large speedups over Python list access?
A typed memoryview is a Cython construct that wraps any object implementing the Python buffer protocol (NumPy arrays, bytearray, array.array, etc.) and provides direct C-level pointer access to the underlying memory.
Without memoryviews, accessing data[i] in a Cython function that receives a Python list involves:
- A call to
PyList_GetItem(data, i) - A bounds check
- Returning a
PyObject* - Unboxing the PyObject to extract the C value
With a typed memoryview double[::1] data, accessing data[i] compiles to a single C array dereference: *((double *)data.data + i * data.strides[0]) - approximately the cost of one memory load instruction.
For a loop over 10 million elements, the difference is: 10M Python API calls vs 10M C pointer dereferences. At ~50ns per Python call vs ~1ns per memory load, the speedup is proportional.
Q3: What are the boundscheck and wraparound Cython directives? When is it safe to disable them?
boundscheck=True (the default) causes Cython to insert an if i < 0 or i >= n check before every array access. This prevents out-of-bounds access from silently corrupting memory - instead you get an IndexError. The cost is one comparison and conditional branch per access - typically 5–30% overhead.
wraparound=True (the default) supports Python's negative indexing convention: data[-1] accesses the last element. Cython implements this by checking for negative indices and adjusting them before the access. Disabling it makes negative indexing produce incorrect results without an error.
It is safe to disable both when:
- The function has been thoroughly tested with the defaults enabled
- All loop bounds are provably within range (e.g.,
for i in range(n)wheren = data.shape[0]) - No negative indices are used anywhere in the function
The typical production workflow: develop with defaults, run tests, then add # cython: boundscheck=False, wraparound=False to the file header and verify tests still pass. If any test fails, the indexing logic has a bug that the checks were hiding.
Q4: How does prange work and what are the requirements for a loop to be safe to parallelise with it?
prange is Cython's parallel range, built on OpenMP. It distributes iterations of a loop across multiple threads, each with its own stack. The GIL must be released before prange is called (typically inside a with nogil: block).
Requirements for safe parallelisation with prange:
-
No Python objects: the loop body cannot create, access, or modify Python objects. The GIL is released.
-
No write-after-read hazards: if iteration
ireads a value that iterationjmight be writing simultaneously, the result is undefined. Each iteration must read from one region and write to a disjoint region (e.g.,result[i] = f(data[i])is safe;result[i] = data[i-1] + data[i+1]may not be, depending on thread scheduling). -
Reduction variables must be declared: if you are accumulating a sum across iterations, use
prange's reduction clause:for i in prange(n, nogil=True): total += data[i]- Cython automatically makestotala thread-private variable and reduces it at the end. -
No dynamic allocation inside the loop unless thread-safe:
malloc/freein the loop body is generally safe; Python allocations are not (GIL is released).
Q5: When should you prefer cffi over ctypes for calling C code from Python? When should you prefer Cython over both?
ctypes is the right choice for simple, infrequent calls into a C library when you cannot modify the build system. It ships with Python (no extra dependencies) and works without compiling anything. The API becomes unwieldy for complex C structures and function pointers.
cffi is preferred when: the C API is complex (many structs, callbacks, or pointer-heavy interfaces), you want to paste actual C header declarations instead of manually mirroring the types in Python, or you need API mode (compile-time linking) for maximum call speed. cffi is also the standard approach for PyPy compatibility.
Prefer Cython over both when: the bottleneck is in the loop body, not just the function call boundary. ctypes and cffi eliminate the Python-C boundary overhead but do nothing for code inside the loop - you still pay Python overhead for every Python operation inside the loop body. Cython compiles the entire function, including the loop interior, to C. For tight numerical loops, Cython typed memoryviews will significantly outperform ctypes/cffi wrappers around equivalent C functions, because the Cython version eliminates all Python overhead in the loop, while ctypes/cffi only eliminates the per-call overhead.
